摘要 :
Recent heavy rainfall-induced flood events, for example in Germany, Australia and USA, have highlighted the relevance of countermeasures in saving human lives and preventing property damage. Newly introduced ML-based flood forecas...
展开
Recent heavy rainfall-induced flood events, for example in Germany, Australia and USA, have highlighted the relevance of countermeasures in saving human lives and preventing property damage. Newly introduced ML-based flood forecasting methods rely on high-intensity synthetic rainfall events due to the sparsity of their real counterpart. Such synthetic data instances can be produced by precipitation generators trained in an adversarial setting on historical rainfall data. Capturing processes for rainfall data are often highly distributed, with multiple radar stations contributing to a centralised data set. However, data centralisation entails challenges regarding data-stream logistics, data locality, and memory overhead. Distributed Analytics (DA) aims to overcome these challenges through decentralised model training by bringing the algorithm to the data instead of vice versa. In this work, we propose a feasibility study evaluating the applicability of DA on hydrological data. As example of use, we choose the decentralised training of rainfall data generators. We introduce a rainfall generator training procedure relying on Generative Adversarial Networks (GANs) and evaluate two DA algorithms: Federated Learning (FL) and Cyclic Institutional Incremental Learning (CIIL). We compare the resulting training outcomes with the centralised model training (CL) approach and find CIIL performed similarly to CL but less stable, while FL outperformed CL by 7.5%. We conclude that the proven feasibility of FL in our simulated distributed setting lays the groundwork for utilising this approach in realistic environments of grander scale while overcoming potential privacy concerns or logistical challenges in the setting of centralised analytics.
收起
摘要 :
Index data distribution is an important approach that provides parallelism and can improve the usability of a distributed parallel database. B~+ tree is a storage structure, which perfectly fits for distributed and parallel indexi...
展开
Index data distribution is an important approach that provides parallelism and can improve the usability of a distributed parallel database. B~+ tree is a storage structure, which perfectly fits for distributed and parallel indexing, and the distributed B~+ tree is adopted to index the massive and rapidly increasing data available in a distributed network. This paper proposes an index data distribution strategy using distributed parallel B~+ tree in a distributed network environment. In our proposal, the basic data distribution strategy can improve the efficiency of a query by utilizing a data fragment method based on the scope of value, and the replica distribution can be adjusted dynamically, according to the number of system access. The performance evaluation and experiment results show that this index data distribution strategy can improve the query's efficiency and load balance.
收起
摘要 :
Well-developed ascospores of Rinodina flavosoralifera have been observed for the first time. The ascospores are described and illustrated and new data on the chemistry and distribution of this species are provided. New chorologica...
展开
Well-developed ascospores of Rinodina flavosoralifera have been observed for the first time. The ascospores are described and illustrated and new data on the chemistry and distribution of this species are provided. New chorological data on Rinodina disjuncta are also included.
收起
摘要 :
Well-developed ascospores of Rinodina flavosoralifera have been observed for the first time. The ascospores are described and illustrated and new data on the chemistry and distribution of this species are provided. New chorologica...
展开
Well-developed ascospores of Rinodina flavosoralifera have been observed for the first time. The ascospores are described and illustrated and new data on the chemistry and distribution of this species are provided. New chorological data on Rinodina disjuncta are also included.
收起
摘要 :AbstractEnd-user Cloud storage is increasing rapidly in popularity in research communities thanks to the collaboration capabilities it offers, namely synchronisation and sharing. CERN IT has implemented a model of such storage nam...
展开AbstractEnd-user Cloud storage is increasing rapidly in popularity in research communities thanks to the collaboration capabilities it offers, namely synchronisation and sharing. CERN IT has implemented a model of such storage named, CERNBox, integrated with the CERN AuthN and AuthZ services. To exploit the use of the end-user Cloud storage for the distributed data analysis activity, the CMS experiment has started the integration of CERNBox as a Grid resource. This will allow CMS users to make use of their own storage in the Cloud for their analysis activities as well as to benefit from synchronisation and sharing capabilities to achieve results faster and more effectively. It will provide an integration model of Cloud storages in the Grid, which is implemented and commissioned over the world’s largest computing Grid infrastructure, Worldwide LHC Computing Grid (WLCG).In this paper, we present the integration strategy and infrastructure changes needed in order to transparently integrate end-user Cloud storage with the CMS distributed computing model. We describe the new challenges faced in data management between Grid and Cloud and how they were addressed, along with details of the support for Cloud storage recently introduced into the WLCG data movement middleware, FTS3. The commissioning experience of CERNBox for the distributed data analysis activity is also presented.Highlights•A model for the integration of end-user Cloud storage in a Grid infrastructure for the distributed data analysis is proposed.•The model relies on changes in the Grid infrastructure and developments in the data movement middleware.•The model shows good performance when moving data between Grid and Cloud at scale.收起
摘要 :
Autoplot is software developed for the Virtual Observatories in Heliophysics to provide intelligent and automated plotting capabilities for many typical data products that are stored in a variety of file formats or databases. Auto...
展开
Autoplot is software developed for the Virtual Observatories in Heliophysics to provide intelligent and automated plotting capabilities for many typical data products that are stored in a variety of file formats or databases. Autoplot has proven to be a flexible tool for exploring, accessing, and viewing data resources as typically found on the web, usually in the form of a directory containing data files with multiple parameters contained in each file. Data from a data source is abstracted into a common internal data model called QDataSet. Autoplot is built from individually useful components, and can be extended and reused to create specialized data handling and analysis applications and is being used in a variety of science visualization and analysis applications. Although originally developed for viewing heliophysics-related time series and spectrograms, its flexible and generic data representation model makes it potentially useful for the Earth sciences.
收起
摘要 :
Bathyllia plumbea n. gen., n. sp. and Enervia elongata n. sp., from Australia, Eachamia bismarckiana n. sp. from Bismarck Island, Thyreocephalus balianus n. sp. and Metolinus balianus n. sp. from Bali are described, and distributi...
展开
Bathyllia plumbea n. gen., n. sp. and Enervia elongata n. sp., from Australia, Eachamia bismarckiana n. sp. from Bismarck Island, Thyreocephalus balianus n. sp. and Metolinus balianus n. sp. from Bali are described, and distributional data are presented.
收起
摘要 :
Heterogeneous mobile, sensor, loT, smart environment, and social networking applications have recently started to produce unbounded, fast, and massive-scale streams of data that have to be processed "on the fly". Systems that proc...
展开
Heterogeneous mobile, sensor, loT, smart environment, and social networking applications have recently started to produce unbounded, fast, and massive-scale streams of data that have to be processed "on the fly". Systems that process such data have to be enhanced with detection for operational exceptions and with triggers for both automated and manual operator actions. In this paper, we illustrate how tracing in distributed data processing systems can be applied to detecting changes in data and operational environment to maintain the efficiency of heterogeneous data stream processing systems under potentially changing data quality and distribution. By the tracing of individual input records, we can (1) identify outliers in a web crawling and document processing system and use the insights to define URL filtering rules; (2) identify heavy keys, such as NULL, that should be filtered before processing; (3) give hints to improve the key-based partitioning mechanisms; and (4) measure the limits of overpartitioning if heavy thread-unsafe libraries are imported.By using Apache Spark as illustration, we show how various data stream processing efficiency issues can be mitigated or optimized by our distributed tracing engine. We describe and qualitatively compare two different designs, one based on reporting to a distributed database and another based on trace piggybacking. Our prototype implementation consists of wrappers suitable for JVM environments in general, with minimal impact on the source code of the core system. Our tracing framework is the first to solve tracing in multiple systems across boundaries and to provide detailed performance measurements suitable for automated optimization, not just debugging. (C) 2018 Elsevier B.V. All rights reserved.
收起
摘要 :
Data collection is required to be safe and efficient considering both data privacy and system performance. In this paper, we study a new problem: distributed data sharing with privacy-preserving requirements. Given a data demander...
展开
Data collection is required to be safe and efficient considering both data privacy and system performance. In this paper, we study a new problem: distributed data sharing with privacy-preserving requirements. Given a data demander requesting data from multiple distributed data providers, the objective is to enable the data demander to access the distributed data without knowing the privacy of any individual provider. The problem is challenged by two questions: how to transmit the data safely and accurately; and how to efficiently handle data streams? As the first study, we propose a practical method, Shadow Coding, to preserve the privacy in data transmission and ensure the recovery in data collection, which achieves privacy preserving computation in a data-recoverable, efficient, and scalable way. We also provide practical techniques to make Shadow Coding efficient and safe in data streams. Extensive experimental study on a large-scale real-life dataset offers insight into the performance of our schema. The proposed schema is also implemented as a pilot system in a city to collect distributed mobile phone data.
收起
摘要 :
Data analysts explore data by inspecting features such as clustering, distribution and correlation. Much existing research has focused on different visualisations for different data exploration tasks. For example, a data analyst m...
展开
Data analysts explore data by inspecting features such as clustering, distribution and correlation. Much existing research has focused on different visualisations for different data exploration tasks. For example, a data analyst might inspect clustering and correlation with scatterplots, but use histograms to inspect a distribution. Such visualisations allow an analyst to confirm prior expectations. For example, a scatterplot may confirm an expected correlation or may show deviations from the expected correlation. In order to better facilitate discovery of unexpected features in data, however, a combination of different perspectives may be needed. In this paper, we combine distributional and correlational views of hierarchical multidimensional data. Our unified view supports the simultaneous exploration of data distribution and correlation. By presenting a unified view, we aim to increase the chances of discovery of unexpected data features, and to provide the means to explore such features in detail. Further, our unified view is equipped with a small number of primitive interaction operators which a user composes to facilitate smooth and flexible exploration.
收起